Computational Analysis of single-cell RNA and ATAC sequencing data

Introduction

This anlaysis was doen to understand the changes in chromatin accessibility and gene expression in young and old hematopoietic stem cells (HSCs). The data was obtained from GEO database (GSE57236). The data was processed using Seurat and Signac packages in R. The focus was on the transcriptional and chromatin accessibility landscapes in young and aged HSCs,, which could provide insights into age-related changes in stem cell function.

How chromatin accessibility correlates with gene expression:

Chromatin accessibility influences gene regulation by modulating transcription factor binding and RNA polymerase recruitment. By integrating ATAC-seq and RNA-seq data, we can assess how changes in chromatin structure impact gene expression.

knitr::opts_chunk$set(
warning = FALSE,
message = FALSE
)

Methodology

loading libraries

#install.packages('BiocManager')
#install.packages('hdf5r') #need to read h5 files
#install.packages('Seurat')
#install.packages("Signac") #seurat add-on for analyzing chromatin (ATAC-seq)
library(Signac)
library(Seurat)

Data

The data used for this analysis is publicly available on GEO under the accession GSE190424. The study provides high-resolution single-cell datasets, allowing detailed insights into HSC biology.

Research presented in the paper: “Epigenetic traits inscribed in chromatin accessibility in aged hematopoietic stem cells” by Itokawa et al., published in Nature Communications (2022). DOI: 10.1038/s41467-022-30374-9 | PMID: 35577813

ATAC-seq peak count matrix and single-cell metadata were loaded using Read10X_h5 function

# load dataset
counts <- Read10X_h5(filename = "~/Desktop/Projects/V1.0/single_cell/data/GSM5723631_Young_HSC_filtered_peak_bc_matrix.h5")

Loading the meta data

# Create ChromatinAssay object for ATAC-seq data processing
meta <- read.csv(
file = '~/Desktop/Projects/V1.0/single_cell/data/GSM5723631_Young_HSC_singlecell.csv.gz',
header = TRUE,
row.names = 1)
chrom_assay <- CreateChromatinAssay(
  counts = counts,
  sep = c(":", "-"), # Genomic coordinate format
  genome = 'mm10', # Mouse genome reference
  fragments = '~/Desktop/Projects/V1.0/single_cell/data/GSM5723631_Young_HSC_fragments.tsv.gz',
  min.cells = 10, # Filter peaks present in <10 cells
  min.features = 200 # Filter cells with <200 peaks
)
# Create Seurat object with chromatin assay
data <- CreateSeuratObject(
  counts = chrom_assay,
  assay = "peaks",
  meta.data = meta
)
# data[[]]

Genome Annotation

# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# 
# BiocManager::install("EnsDb.Mmusculus.v79")
# BiocManager::install("GenomeInfoDb") #translation between chromosome names
# BiocManager::install("biovizBase")
library(GenomeInfoDb)
library(EnsDb.Mmusculus.v79) # mouse genome annotation
annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)
seqlevelsStyle(annotations) <- 'UCSC' # chromosome naming 
Annotation(data) <- annotations # add genome annotation to Seurat object
# calculate nucleosome signal position
data <- NucleosomeSignal(object = data) #fragment ratio 147-294: <147
#calculate TSS enrichment
data <- TSSEnrichment(object = data, fast = FALSE)
#calculate blacklist ratio
data$blacklist_ratio <- data$blacklist_region_fragments / data$peak_region_fragments

# data[[]]
#calculate percentage of reads in peaks
data$pct_reads_in_peaks <- data$peak_region_fragments / data$passed_filters * 100 
# Visualize data quality metrics usign violin plot 
VlnPlot(
  object = data,
  features = c('peak_region_fragments', 'pct_reads_in_peaks', 
                'blacklist_ratio', 'nucleosome_signal', 'TSS.enrichment'),
  pt.size = 0.1,
  ncol = 5
)

# Filter cells based on qc filtered based on quality thresholds to remove low-fidelity data
data <- subset(
  x = data,
  subset = peak_region_fragments > 3000 &
    peak_region_fragments < 20000 &
    pct_reads_in_peaks > 15 &
    blacklist_ratio < 0.05 &
    nucleosome_signal < 4 &
    TSS.enrichment > 2
)
# filtering dynamic range of data
low_prf <- quantile(data[["peak_region_fragments"]]$peak_region_fragments, probs = 0.02)
hig_prf <- quantile(data[["peak_region_fragments"]]$peak_region_fragments, probs = 0.98)
low_prp <- quantile(data[["pct_reads_in_peaks"]]$pct_reads_in_peaks, probs = 0.02)

high_blr <- quantile(data[["blacklist_ratio"]]$blacklist_ratio, probs = 0.98)

hig_ns <- quantile(data[["nucleosome_signal"]]$nucleosome_signal, probs = 0.98)

low_ts <- quantile(data[["TSS.enrichment"]]$TSS.enrichment, probs = 0.02)
print(low_prf)
##     2% 
## 3531.7
print(hig_prf)
##   98% 
## 15713
print(low_prp)
##       2% 
## 72.68298
print(high_blr)
##        98% 
## 0.01136435
print(hig_ns)
##       98% 
## 0.8160401
print(low_ts)
##       2% 
## 4.098172

Subset data based on quality metrics to refine the cell selection using quantile thresholds

data <- subset(
  x = data,
  subset = peak_region_fragments > low_prf &
    peak_region_fragments < hig_prf &
    pct_reads_in_peaks > low_prp &
    blacklist_ratio < high_blr &
    nucleosome_signal < hig_ns &
    TSS.enrichment > low_ts
)

Normalization, dimension reduction, and clustering

The data are normalized using TF-IDF normalization. Then, dimensionality reduction is performed using Singular Value Decomposition (SVD) before UMAP is applied. Finally, graph-based clustering groups the cells based on the latent space.

data <- RunTFIDF(data)
data <- FindTopFeatures(data, min.cutoff = 'q0')
data
## An object of class Seurat 
## 165871 features across 4363 samples within 1 assay 
## Active assay: peaks (165871 features, 165871 variable features)
##  2 layers present: counts, data
data <- RunSVD(data)
#depth correlation
DepthCor(data)

#UMAP, find neighbors and perform clustering
data <- RunUMAP(object = data, reduction = 'lsi', dims = 2:30)
data <- FindNeighbors(object = data, reduction = 'lsi', dims = 2:30)
data <- FindClusters(object = data, verbose = FALSE, algorithm = 3)
DimPlot(object = data, label = TRUE) + NoLegend()

Merging Multiple Samples Multiple samples

To compare young and aged HSCs, the ATAC data from two samples are merged. A custom function import_atac is defined for data import.

# Custom function to merge and integrate data from multiple samples
import_atac <- function(count_path, meta_path, fragment_path){
  counts <- Read10X_h5(filename = count_path)
  
  meta <- read.csv(
  file = meta_path,
  header = TRUE,
  row.names = 1)
  
  
  
    chrom_assay <- CreateChromatinAssay(
    counts = counts,
    sep = c(":", "-"),
    genome = 'mm10',
    fragments = fragment_path,
    min.cells = 10,
    min.features = 200
  )
  
  data <- CreateSeuratObject(
    counts = chrom_assay,
    assay = "peaks",
    meta.data = meta
  )
  
  Annotation(data) <- annotations
  
  
  data <- NucleosomeSignal(object = data) #fragment ratio 147-294: <147  ---  mononucleosome:nucleosome-free
  
  
  data <- TSSEnrichment(object = data, fast = FALSE)
  
  data$blacklist_ratio <- data$blacklist_region_fragments / data$peak_region_fragments
  
  data$pct_reads_in_peaks <- data$peak_region_fragments / data$passed_filters * 100 
  
  low_prf <- quantile(data[["peak_region_fragments"]]$peak_region_fragments, probs = 0.02)
  hig_prf <- quantile(data[["peak_region_fragments"]]$peak_region_fragments, probs = 0.98)
  low_prp <- quantile(data[["pct_reads_in_peaks"]]$pct_reads_in_peaks, probs = 0.02)
  
  high_blr <- quantile(data[["blacklist_ratio"]]$blacklist_ratio, probs = 0.98)
  
  hig_ns <- quantile(data[["nucleosome_signal"]]$nucleosome_signal, probs = 0.98)
  
  low_ts <- quantile(data[["TSS.enrichment"]]$TSS.enrichment, probs = 0.02)
  
  data <- subset(
    x = data,
    subset = peak_region_fragments > low_prf &
      peak_region_fragments < hig_prf &
      pct_reads_in_peaks > low_prp &
      blacklist_ratio < high_blr &
      nucleosome_signal < hig_ns &
      TSS.enrichment > low_ts
  )
  
  
  
  #data <- RunTFIDF(data)
  #data <- FindTopFeatures(data, min.cutoff = 'q0')
  #data <- RunSVD(data)

  return(data)
}

Importing HSC data from various groups

young <- import_atac("~/Desktop/Projects/V1.0/single_cell/data/GSM5723631_Young_HSC_filtered_peak_bc_matrix.h5",
         '~/Desktop/Projects/V1.0/single_cell/data/GSM5723631_Young_HSC_singlecell.csv.gz',
         '~/Desktop/Projects/V1.0/single_cell/data/GSM5723631_Young_HSC_fragments.tsv.gz')
old <- import_atac(
  "~/Desktop/Projects/V1.0/single_cell/data/GSM5723632_Aged_HSC_filtered_peak_bc_matrix.h5",
  '~/Desktop/Projects/V1.0/single_cell/data/GSM5723632_Aged_HSC_singlecell.csv.gz',
  '~/Desktop/Projects/V1.0/single_cell/data/GSM5723632_Aged_HSC_fragments.tsv.gz'
)
# Assign the dataset labels to the samples
young$dataset <- "young"
old$dataset <- "old"
# Merge the two datasets and enforce unique cell names
data <- merge(young, old)
# inspect merged data object
data
## An object of class Seurat 
## 177359 features across 7792 samples within 1 assay 
## Active assay: peaks (177359 features, 0 variable features)
##  2 layers present: counts, data
# additional normalization and dimension reduction on the merged dataset
data <- FindTopFeatures(data, min.cutoff = 'q0')
data <- RunTFIDF(data)
data <- RunSVD(data)
data
## An object of class Seurat 
## 177359 features across 7792 samples within 1 assay 
## Active assay: peaks (177359 features, 177359 variable features)
##  2 layers present: counts, data
##  1 dimensional reduction calculated: lsi
# Calculate UMAP embeddings and construct the nearest neighbor graph
data <- RunUMAP(object = data, reduction = 'lsi', dims = 2:30)
data <- FindNeighbors(object = data, reduction = 'lsi', dims = 2:30)
# Perform clustering on the merged dataset at a resolution of 0.4
data <- FindClusters(object = data, verbose = FALSE, algorithm = 3, resolution = .4)
# Visualize clusters and compare groups using UMAP
DimPlot(object = data, label = TRUE) + NoLegend()

# Visualize clusters and compare groups using UMAP
DimPlot(object = data, label = TRUE, group.by = "dataset") + NoLegend()

Data analysis

we integrate gene activity analysis to determine how chromatin accessibility correlates with gene expression. The gene activity matrix is a proxy for the potential transcriptional output from regions of open chromatin. We then visualize the gene activity scores for selected genes and perform differential accessibility analysis to identify differentially accessible regions between young and aged HSCs.

gene.activities <- GeneActivity(data)
# gene activities as a new RNA assay and normalize the data
data[['RNA']] <- CreateAssayObject(counts = gene.activities)

data <- NormalizeData(
  object = data,
  assay = 'RNA',
  normalization.method = 'LogNormalize',
  scale.factor = median(data$nCount_RNA)
)
# examine the RNA assayy
data[['RNA']]
## Assay data with 21821 features for 7792 cells
## First 10 features:
##  Hnf4g, Zfhx4, Pex2, UBC, 1700008P02Rik, Pkia, Zc2hc1a, Il7,
## 1700010I02Rik, Stmn2
# Switch the default assay to RNA and visualize expression of key genes (e.g., Kit, Pecam1, Itgam)
DefaultAssay(data) <- 'RNA'


FeaturePlot(
  object = data,
  features = c('Kit', 'Pecam1', 'Itgam'),
  max.cutoff = 'q95'
)

This analysis highlights genomic regions that display significant differences in chromatin accessibility between the two conditions

DefaultAssay(data) <- 'peaks'

da_peaks <- FindMarkers(
  object = data,
  ident.1 = rownames(data[[]][data$dataset == "old",]),
  ident.2 = rownames(data[[]][data$dataset == "young",]),
  min.pct = 0.05,
  test.use = 'LR',
  latent.vars = 'peak_region_fragments'
)
da_peaks
# Annotate the differentially accessible peaks by finding the nearest gene and the distance 
da_peaks$closest_gene <-ClosestFeature(data, regions = rownames(da_peaks))$gene_name
da_peaks$distance <- ClosestFeature(data, regions = rownames(da_peaks))$distance
da_peaks

Inference from Differential Accessibility Analysis:

  • Distinct Chromatin States: The UMAP visualization shows clear separation between young and aged HSCs, indicative of distinct chromatin states.

  • Regulatory Differences: Differential accessibility results reveal specific regulatory regions that change with age. In particular, peaks with significant differences may correspond to regulatory elements controlling genes involved in stem cell function.

  • Potential Impact on Gene Expression: Integration with the RNA assay suggests that alterations in chromatin accessibility are likely influencing the expression of key genes (as visualized by the FeaturePlot of Kit, Pecam1, and Itgam).

  • Biological Relevance: The proximity of differentially accessible peaks to regulatory genes supports the hypothesis that shifts in chromatin structure contribute to the functional decline of aged HSCs.

# chromatin accessibility profile 
CoveragePlot(
  object = data,
  region = rownames(da_peaks)[2],
  extend.upstream = 10000,
  extend.downstream = 5000,
  group.by = "dataset"
)

# violin and feature plots to obtain both quantitative and spatial context of the differentially accessible peak
plot1 <- VlnPlot(
  object = data,
  features = rownames(da_peaks)[2],
  group.by = "dataset"
)
plot2 <- FeaturePlot(
  object = data,
  features = rownames(da_peaks)[2],
  max.cutoff = 'q95'
)

plot1 | plot2

Summary :

UMAP plots reveal clear clustering patterns that distinguish young hematopoietic stem cell (HSC) populations from aged ones.

Analysis of chromatin accessibility identifies specific regions that undergo significant changes with age, pointing to alterations in regulatory activity.

By integrating gene activity data with ATAC-seq results, several candidate genes emerge as potential drivers of the phenotypic differences observed between young and aged HSCs.

Conclusion :

This multiomics study integrates single-cell RNA sequencing and ATAC-seq data to provide a detailed perspective on changes in chromatin accessibility and gene expression in young versus aged HSCs. The findings highlight distinct regulatory landscapes between the two groups, shedding light on molecular mechanisms that contribute to stem cell aging.